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ABSTRACT 



A system and method for circumventing schemes that use 
duplicalion detection lo delect and block unsolicited e-mail 
(spam.) An address on a lisl is assigned to one ofmsublists, 
where m is an integer that is greater than one. A set of m 
different messages are created. A different message from the 
set of m different messages is sent to the addresses on each 
sublist. In this way, spam countcrmcasurcs based upon 
duplicate detection schemes are foiled. 

10 Claims, 4 Drawing Sheets 
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FIG. 2 
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FIG. 4 



(40, 318} 



M9/10 



Ml/2 
.'(50, 304) 



/ / Hi/io 

(30, 228)/ .'(40, 233)/ (50, 235) 



/ / /(40, 177) 
.•(30, 164) 



(20, 142)/ 



/(30, 122) 



/ .* / 
/ .'(20,100) 

/ / 

(10, 62)/.'* /(20, 72) 

.' •*(10,'41) 

In, 3D 

^•■i'm i , 1 — 

10 20 30 40 



50 



7/22/05 EPR1.1 5-16 



US 6,6< 

1 

SYSTEM AND METHOD FOR 
COUNTERACTING MESSAGE FILTERING 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 
This application claims the benefit of U.S. Provisional 
Application No. 60/112,998, filed on Dec. 18, 1998. 

BACKGROUND OF THE INVENTION 
The receipt of unsolicited electronic mail ("e-mail") mes- 
sages ("spam") has become a nuisance for networked com- 
puter users. In response, numerous techniques have been 
developed to detect spam and prevent it from being deliv- 
ered to its intended destination. Several known methods of 
filtering, spam arc based upon deluding and deleting dupli- 
cate copies of a spam message. For example, one known 
method of filtering spam from more legitimate messages is 
called Filtering by Duplicate Detection (FDD). A sender of 
spam (a "spammer'') typically does not know if two or more 
addresses on his address list point to the same mailbox. The 
FDD method creates and maintains two or more e-mail 
addresses that point to the same mailbox. Whenever the 
same message is received more than once, it is determined 
to be spam and is deleted. Additionally, information from the 
spam can be stored (e.g., in a database) for use in identifying 
other spam (e.g., e-mail from the same sender, with the same 
subject line, etc.) 

Another known method of filtering spam is called Col- 
laborative Filtering (CF). In the CF method, many users 
work together to maintain a central repository of received 
spam messages and all users' mail software checks this 
repository to see if a given message is in it; if so, the message 
is deleted from the user's mail box. The power of CF stems 
from its automatic detection of duplicate messages by the 
user's e-mail client software comparing each newly arrived 
message with the list of spam messages maintained at the 
central server. 

A third method, Manual Filtering (MF), is the most 
widely used method in the Internet today. Users of MF read 
all or part of each message and determine whether it is spam. 
Due to properties of the human visual and cognitive system, 
MF users can more easily and quickly detect a copy of a 
previously seen message than they can determine whether a 
message is spam. Thus, MF users also benefit from duplicate 
detection through increased efficiency. 

Existing approaches to solving the spam problem further 
include rule-based filtering, cryptographic authentication. 
See RSA Data Security; "S/MIME Central"; http: 
//raw.rsa.com/'smime/; and S. Garfinkel; PGP: Pretty Good 
Privacy; Sebastopol, Calif.: O'Reilly and Assoc; 1995. 
Various sendmail enhancements have also been proposed 
and implemented. See B. Costales, E. Allman, & N. Rickert; 
Sendmail; Sebastopol, Calif.: O'Reilly and Assoc; 1993. See 
http://~.sendmail.org/ for the latest enhancements; and see 
email channels in R. J. Hall; How to avoid unwanted email; 
Comm. ACM41(S I ), 88-95, March 1998. These techniques 
are all of varying levels of effectiveness, applicability, and 
practicality. For surveys of anti-spam technology, see L. 
Cranor, B. LaMacchia; Spam!; to appear in Comm. ACM, 
1998. http://wwu.research.att.com/~lorrie/pubs/spam!; and 
R. J. Hall; How to avoid unwanted email; Comm. ACM41 
(S ). 88-95, March 1998. 

Summarizing, in FDD, the idea is to maintain and pub- 
licly distribute two (or more) email addresses, both forward- 
ing to the same mailbox. An email software agent then 
automatically deletes any messages that are received more 



,686 Bl 

2 

than once. It gets its power from the fact that spammers 
(originators of spam) have no general way of telling when 
two addresses they have culled from newsgroups, web sites, 
etc, point to the same mailbox. In CF, the idea is that a group 

5 of email users establishes a central server that maintains a 
list of known spam messages; each time a new spam 
message is received (and recognized as such) by some user, 
that user adds it to the server's list, ["hen, each user employs 
agent software that screens out any message appearing on 

to the server's list. Even MF, where the user reads and recog- 
nizes spam messages himself, benefits from duplicate 
detection, because spammers often send messages many 
times to the same list; the attentive MF user will more 
quickly delete second and succeeding copies, due to the 

15 power of human visual pattern recognition. 

SUMMARY OF THE INVENTION 

An embodiment of the present invention includes a sys- 
tem and method for counteracting schemes lor flocking 

20 unsolicited e-mail (spam) that arc based upon duplicate 
detection. An e-mail recipient's address on a list is assigned 
to one of m sublists, where m is an integer greater than 1 . m 
different messages are created. A different the message of the 
m messages is sent to the addresses on each sublist. In this 

25 way, spam countermeasures based upon duplicate detection 
schemes are foiled. 

DESCRIPTION OF THE DRAWINGS 
FIG. 1 is a partial schematic diagram of an exemplary 
30 system in accordance with one embodiment of the present 
invention. 

FIG. 2 is a partial schematic diagram of an exemplary 
aparatus in accordance with one embodiment of the present 
invention. 

I Id. 3 is a table showing computed values of P(m, n), the 

probability that all versions of an original message will be 
filtered. 

FIG. 4 is a graph showing a number of objects n versus 
40 a number of slots m for given screening probabilities. 

DETAILED DESCRIPTION OF THE 
INVENTION 

In accordance with an embodiment of the present 

45 invention, the effectiveness of techniques that exploit dupli- 
cate detection can be counteracted by partitioning the spam 
address list into two or more sublists, and then sending a 
different version of the message to all of the recipients on 
each sublist. The object is to defeat duplicate detection by 

50 decreasing the likelihood of receiving duplicates by, in turn, 
increasing the number of message versions. For example, 
list splitting defeats FDD by sending different versions of the 
message to the different mailboxes representing the same 
user; they will therefore not be identified as duplicates and, 

55 hence, not deleted as spam. List splitting defeats collabora- 
tive filtering by decreasing the likelihood that every message 
version will be seen by an active user (i.e., one who actually 
reports messages to the central server), and hence reported 
to the central server. Fist splitting defeats MF when the 

60 spammer sends the message multiple times, re-randomizing 
each time, thereby making it likely that each user will 
receive multiple different versions. 

In accordance with one embodiment of the present 
invention, a spam address list is partitioned into m sublists, 

65 m different versions of the message are created, and a 
different version of the message is sent to the recipients on 
each sublist. 
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It is advantageous to partition an address list into m 
sublists by randomly assigning each address on the list to 
one of the m sublists. This advantageously helps avoid 
placing two addresses that occur nearby each other on the 
list to the same sublist. Nearby addresses on the list may be 5 
to the same mailbox. If two such addresses appear on the 
same sublist, then the same version is sent to the two 
addresses, and the message is disadvantageous^ detectable 
as spam. 

Another way to partition an address list into in sublists is ]( 
to use any information available about the addresses to avoid 
placing two or more addresses to the same mailbox on the 
same sublist. For example, addresses that are substantially 
similar (e.g., fredl@xyz.com and fred2@xyz.com) are 
deliberately placed on different sublists. Identifying an ^ 
address on the list that is substantially similar to another 
address can be advantageously performed by a number of 
string comparison methods well known in the art. In one 
embodiment of the present invention, a first address is 
"substantially similar" to a second address if the first address 0( 
at least 50% of its characters occur in the second address in 
the same order as in the first address. In another 
embodiment, two addresses are substantially similar even if 
they share fewer than 50% of the characters in either 
address, but contain the same distinctive string. For 
example, in this embodiment, the addresses " 
ASlXB4@zebra.com is "substantially similar" to 
ASlXB4@phoenix.net. Even though both addresses share 
only a small string in relation to their sizes, there are grounds 
to suspect that they may pertain to the same user. Any 3( 
convenient metric for determining substantial similarity can 
be used in accordance with the present invention, provided 
such a metric is designed to identify distinct addresses that 
are likely to point to the same mailbox. 

A system in accordance with an embodiment of the 3 « 
present invention is show n in FIG. I. Mail transfer agent 101 
is coupled to e-mail sender 102 through network 103. Mail 
transfer agent 101 is coupled to user A 104, user B 105 and 
user C 106. Mail transfer agent 101 receives e-mail 
addressed to user A 104, user B 105 and user C 106, and then 4( 
forwards the e-mail to its intended destination. In this 
example, user A 104 has three e-mail addresses, Al, A2 and 
A3; user B 105 has two e-mail addresses, Bl and B2; and 
user C 106 has two e-mail addresses, CI and C2. Sender 102 
stores an address list. In the embodiment shown in FIG. 1, 4< 
sender 102 stores addresses including Al, A2, A3, Bl, B2, 
CI and C2. In accordance with the present invention, sender 
102 creates sublists. and assigns the addresses on the list to 
the sublists. In this example, the sender assigns the addresses 
on the list to three sublists as follows: „ 



ants is created. Software can then generate message v< 
by systematically choosing one variant from each set of 
paragraph-variants to make up each version. This allows 
exponentially many message versions to be created from a 
small amount of spammer effort. 

The sender sends a different version of the message to 
each sublist. In the example shown in FIG. 1, the sender 
sends a first version of the message to addresses Al and Bl 
on sublist 1; a second version to addresses A2, B2 and CI 
on sublist 2; and a third version to addresses A3 and C2 on 
sublist 3. As shown in FIG. 1, the messages first arrive at 
mail transfer agent 101. 



Mail i 



nsfer 



t 101 



s the 



The sender creates a different version of an e-mail mes- 
sage (e.g., spam) for each sublist. Examples of differences t 
between versions include different source addresses, differ- 
ent subject lines and variations in the body of the message, 
and any combination thereof. Another method for systematic 
message version generation is based on the message origi- 
nator creating a number paragraph-variant sets, where for <- 
one or more paragraphs in the original message, a collection 
of semantically equivalent, yet syntactically different, vari- 



addresses for each mailbox. For example, the mail transfer 
agent 101 stores information that indicates that addresses 
Al, A2 and A3 indicate a single mailbox for user A 104; 
addresses 151 and H2 point to a single mailbox lor user H 
105; and addresses CI and C2 point to a mailbox for user C 
106. Mail transfer agent 101 implements a FDD process by 
detecting duplicate messages that are sent to the same 
mailbox. If two or more such messages are determined to be 
the same, then they arc classified to be spam, and arc 
prevented at the mail transfer agent 101 from being deliv- 
ered to their intended recipient (mailbox). For example, the 
messages are deleted by the mail transfer agent 101. 

By splitting the senders address list into sublists, and 
sending different versions of a message to each sublist, the 
present invention advantageously circumvents the FDD 
process, and spam can be successfully delivered to its 
intended destination. In the example shown in FIG. 1, 
comparing the sender's messages to any one mailbox will 
show no duplication: three different versions are addressed 
to Al, A2 and A3, respectively, and two different versions 
arc addressed to Bl and B2, respectively, and to CI and C2, 
respectively. 

An embodiment of the present invention is just as effec- 
tive when the FDD method is implemented at the user (e.g., 
at user A 104, user B 105 and user C 106), rather than at an 
intermediary between the sender and the user, such as a mail 
transfer agent 101. 

List splitting is also an effective countermeasure against 
the collaborative filtering method. Every time a variant of an 
electronic message is generated and sent by a spammer, the 
new variant is not generally identified as belonging to a set 
of spam messages stored in a central repository. For 
example, a message identified as spam is stored with a body 
that includes the text "SUBSCRIBE NOW TO TELCO 
SERVICES 9c PER MINUTE LONG DISTANCE." A new 
message that is a linguistic variant of the stored message is 
received at a mailbox. For example, the new message body 
includes, "We are offering the cheapest calling anywhere." 
The new message is compared to messages in the central 
repository, but no match is found because of the linguistic 
variation between the new and old messages, even though 
their meaning is substantially the same. The collaborative 
filtering method advantageously fails to detect that the new 
message generated in accordance with an embodiment of the 
present invention is spam. 

Likewise, list splitting is also effective against manual 
filtering. A user is more likely to recognize and delete a 
duplicate ol a spam message before reading very much of it. 
The variant versions of the same message sent in accordance 
with an embodiment of the present invention at least dimin- 
ish the advantage of prior experience in detecting and 
deleting spam. For example, a user is more likely to read 
more of a second message that states, "We are offering the 
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cheapest calling anywhere" after having previously read a 
first message that states "SUBSCRIBE NOW TO TELCO 
SERVICES 9(2 PER MINUTE LONG DISTANCE" than he 
is to read a second, duplicate occurrence of the first message. 

An apparatus in accordance with an embodiment of the 
present invention is shown in FIG. 2. The embodiment 
includes a processor 201, a memory 202 and a port 203. The 
processor in one embodiment is a general purpose 
microprocessor, such as the Pentium II processor manufac- 
tured by the Intel Corporation of Santa Clara, Calif. In 
another embodiment, the processor 201 is an Application 
Specific Integrated Circuit (ASIC), which has been designed 
to perform in hardware and firmware at least part of the 
method in accordance with an embodiment of the present 

Memory 202 is any device adapted to store digital 
information, such as Random Access Memory (RAM), flash 
memory, a hard disk, an optical digital storage device, any 
combination thereof, etc. As shown in FIG. 2, memory 202 
is coupled to processor 201. The term "coupled" means 
connected directly or indirectly. Thus, A is "coupled" to C if 
A is directly connected to C, and A is "coupled" to C if A is 
connected directly to B, and B is directly connected to C. 

Memory 202 stores filter countermeasure instructions 204 
that are adapted to be executed by processor 201. The term 
"adapted to be executed" is meant to encompass any instruc- 
tions that are ready to be executed in their present form (e.g., 
machine code) by processor 201, or require further manipu- 
lation (e.g., to be compiled, decrypted, be provided with an 
access code, etc.) to be ready to be executed by processor 
201. Filter countermeasure instructions 204 are adapted to 
be executed by processor 201 to perform the method in 
accordance with an embodiment of the present invention. 
For example, filter countermeasure instructions 204 are 
adapted to be executed by processor 201 to divide a list into 
m sublists, generate m different versions of a message, and 
send a different version to each sublist. A version is said to 
be "sent to a sublist" when it is sent to at least one address 
on the sublist. Memory 202 also stores a list and a sublist in 
one embodiment. Memory 202 is meant to encompass 
separate digital storage devices (e.g.. a database, a remote 
hard disk, etc.); in other words, memory 202 encompasses 
distributed memory. Thus, filtering countermeasure instruc- 
tions can be stored in one memory device, while a list can 
be stored on another memory device accessed by processor 
201 through a network via port 203, and a sublist can be 
stored on yet another memory device. In FIG. 2, an address 
list data structure 205 is shown stored in memory 202. In this 
embodiment, the list data structure 205 includes a set of 
addresses of prospective recipients of messages, e.g., a spam 
address list. FIG. 2 also shows memory 202 storing a 
plurality of sublist data structures 206. Each sublist data 
structure of the plurality of sublist data structures 206 
includes a subset of the addresses that are included in the 
address list data structure 205. Port 203 is adapted to be 
coupled to a network. Port 203 is also coupled to processor 
201. 

In accordance with one embodiment of the present 
invention, filtering countermeasure instructions are stored 
on a medium and distributed as software. The medium is any 
device adapted to store digital information, and corresponds 
to memory 202. For example, a medium is a portable 
magnetic disk, such as a floppy disk; or a Zip disk, manu- 
factured by the Iomega Corporation of Roy, Utah; or a 
Compact Disk Read Only Memory (CD-ROM) as is known 
in the art for distributing software. The medium is distrib- 
uted to a user that has a processor suitable for executing the 
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filtering countermeasure instructions, e.g., to a user with a 
server having a processor, memory and a port adapted to be 
coupled to a network. 

A more rigorous, analytical treatment of how the list- 

5 splitting countercountermeasure family, LS(m), works is as 
follows. The LS(m) spammer wants to send a message of 
given semantic content to as many members of a mailing list 
as possible. The spammer first creates m equivalent but 
non-identical versions of the message, then randomly 

1Q assigns each address in the list one of the m versions, and 
finally sends each address its assigned version. Intuitively, 
this decreases the effectiveness of duplicate detection tech- 
niques because there are far fewer pairs of duplicate mes- 
sages to be detected, as long as different message versions 
are not easily detectable as such. Later, we analyze the 

15 effectiveness of LS assuming this condition is true. 

Prior to that, however, how realistic is it to assume that m 
distinct versions can be systematically generated? First of 
all, it is straightforward to apply simple syntactic variations 
to a message in such a way that the semantic content of the 

20 message is unchanged, using techniques analogous to those 
of digital document marking, see J. Brassil, S. Low, N. F. 
Maxemeliuk. I .. OViorman; Llcctronic marking and identi- 
fication techniques to discourage document copying; IEEE 
J. Selected Areas in Communications 13(8), 1495-1504, 

25 October, 1995. One can add or remove whitespaee 
characters, change capitalization or punctuation, and add or 
remove banners and other peripheral information. 

However, for any set of simple syntactic variations, one 
can also imagine a clever automatic duplicate detector that 

30 might not be fooled. More problematic for automatic 
detection, however, would be the method of linguistic vari- 
ants: while composing the original message, select k of the 
paragraphs and compose two linguistically completely 
different, yet practically semantically equivalent, para- 

35 phrases for each. For example, one paragraph might be 
Jane Doe, of East Nowhere, Maine, writes "My whole life 
was changed when I joined the team. My looks 
improved, 1 met people with better cars, and several 
thousand dollars per day arrived in my mail box." 

40 The following paragraph might well be equivalent for the 
spammer's purposes: 

Marvin Smith, of Central Prairie, Montana, writes "I 
couldn't believe how much happier I became after 
signing on! The extra money allowed me to pursue my 

45 childhood dreams." 

Once the k paragraph pairs are composed, it is simple to 
automate 2'' message variations, by systematically choosing 
one variant from each pair in all possible ways. Moreover, 
in order to avoid duplicate detectors that look for large 

50 percentage overlaps, one can use 2k paragraph pairs and use 
coding techniques that allow 2 k widely different message 
variations. Reordering the paragraphs may help as well, 
when that is possible semantically. 

Another obfuscating technique is to vary the text at the 

55 word level by replacing selected words with synonyms; e.g., 
fantastic±5marvelous, etc. Thus, one can evade even dupli- 
cate detectors that look for long common phrases. 

Using only linguistic variants, increasing the number of 
versions m requires only a logarithmic effort. The other 

60 techniques mentioned require no significant human effort. 
These techniques (and others) show it is quite reasonable to 
assume a spammer can systematically generate large num- 
bers of practically undetectable variations of a message at 
very low cost. 

65 We now consider three anti-spam technologies based on 
duplicate-detection, and quantitatively analyzes the effec- 
tiveness of LS (list splitting) against each. 
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In Filtering by Duplicate Detection (FDD), the user 
maintains some number n of different addresses that all 
forward to one and the same mailbox, as well as a software 
agent that automatically delects and deletes any message 
received more than once. Friends and other legitimate cor- 
respondents will send to only one address, so their messages 
will not be multiply received and hence deleted. On the other 
hand, for newsgroup postings and web page mailtos 
(primary sources from which spammers collect mailing lists 
as discussed in L. Cranor, B. LaMacchia; Spam!; to appear 
in Comm. ACM, 1998. http://wwu.research.att.com/[lorrie/ 
pubs/spam!) one either provides all the addresses, assuming 
the spammer's collection tools can't detect this, or else 
randomly switches among the addresses. This makes it 
likely spammers will pick up all of the addresses in their 
automatic collection processes. Information about FDD can 
be found in B. K. Sherman, personal communication, 1998. 

Detecting Duplicate Messages. One needs to be a bit 
clever in detecting duplicate messages. Header lines will 
vary quite a bit when the same message is sent via two 
different addresses. Header information such as TcceivcJ- 
by," "to," "date," and message IDs can be quite different. 
Also, since messages can take widely different amounts of 
real time to travel the different routes, the question arises of 
how long to hold a message (waiting for possible duplicates) 
before presenting it to the user. These subtleties are resolved 
adequately by comparing message bodies only, and by doing 
the filtering at regular intervals, such as once per day when 
logging in. 

Advantages and Disadvantages. Compared to other anti- 
spam techniques, FDD is relatively practical and usable. 
One need not constantly maintain complex filtering rule sets; 
one need not manage keys, certificates, and trust policies; 
and one doesn't even classify messages as spam or nonspam. 
The only slight complication is easily distributing all the 
addresses when posting to newsgroups and filling out 
product-registration cards. The use of signature files helps 
with this problem. 

It is not without usability disadvantages, however. While 
many organizations offer forwarding accounts for free, the 
typical user may have to pay for most of his forwarding 
addresses. Some messages that the user wishes to see, such 
as conference and talk announcements, may be received 
multiple times due to overlapping interest distribution lists 
or multiple reminders. These are filtered out. Of course, if 
two copies are both addressed to the same one of the user's 
addresses, then a clever filter can choose not to delete them. 
Also, the mail server must handle and possibly download n 
copies of each spam message, increasing the user's and 
ISP's costs. Finally, the need for delaying message delivery 
in order to see if duplicates arrive will bother those users 
wishing to receive their messages as soon as possible. 

We now consider FDD vs LS(m). The probability of a 
2-address FDD user successfully filtering a message is 1 m. 
where m is the number of slots (sublists). On the other hand, 
adding an address tends to increase the screening probability 
for fixed m. In an arms race of user versus spammer, we can 
expect to see changes in both m and n, so we need to analyze 
the probability variation for all m and n. 

Definition. Suppose a user maintains n email addresses, 
and a spammer randomly (uniformly) assigns each address 
to one of m sublists, sending a distinct version of an original 
message to each of the sublists. Define P(m,n) to be the 
probability that FDD will successfully filter out (all versions 
of) the original message. 

More abstractly, this is the same as the probability of 
randomly choosing an assignment of n (distinct) objects to 
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m slots in such a way that none of the m resulting slot-sets 
is a singleton; the message fails to reach the user if and only 
if every message variation goes to either zero or at least two 
of the user's n addresses. 

5 Consider how to calculate P(m,n). In order to find a 
formula for P, it is useful to count the number of assignments 
that leave at least one slot (subset) a singleton. In particular, 
let us define a family of approximations to this as follows. 
Let S^n^n) be the number of assignments leaving is slot #1 

10 a singleton. Let S 2 (m, n) be the number of assignments 
leaving at least one of slots #1 and #2 a singleton. For any 
kSm, let Sj(m,n) be the number of assignments leaving at 
least one of slots #1 . . . #k a singleton. Then, since there are 
m" total assignments, we can define an approximation 

15 Pjt(m,n) to P as 

Definition P ( (m, n) = 1 — — ■ — - 



20 Now, from the definition, S t (m,n)SS t+1 (m,n) for all 
1 £k<m, because every assignment leaving a singleton in 
one of the first k slots also leaves a singleton in one of the 
first k+f slots. It follows, therefore, that 

The latter equality holds because a message is filtered iff 
there is not a singleton among the (first/all) m slots. 
Theorem FDD1. The number of assignments of n things 
30 to m slots leaving a singleton in at least one of the first k 

s k (m,n)= ^ (-iy-'( ,y {m -jr J 

35 i=1 

where, following notation of Graham et al in R. Graham, D. 
Knuth, O. Patashnik; Concrete Mathematics: A Foundation 
for Computer Science; Reading, Mass.: Addison-Wesley; 10 
40 1989, 1994., 

a l = { 1 if (7-0) 

\a(a -lXfl - 2)... (a - j+ 1) if (integer ; > 0) 

45 

Proof. See Appendix A. 1, at the end of the Detailed 
Description. 

Substituting this formula into the definition of P A ., and setting 
k=m, 

50 

Corollary FDD2 



It is simple to write a program to compute values of P 
given values for m and n, in a language providing exact 

60 bignum arithmetic, such as Common Lisp. The table shown 
in FIG. 3 gives computed values of P to three decimal places 
for low values of m and n. Note that proceeding from left to 
right in a row, we do not have monotonic increase, as we 
might expect from the fact that the "average number of 

65 elements per slot" goes up. However, as n— »°° for fixed m, 
the probability that any particular slot will have exactly 1 out 
of n addresses assigned to it goes to zero. Therefore, P(m, 
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n)— »1 for fixed m as N— »oo, so a user can keep the screening 
probability high by increasing n. 

bstimating Np(m): How high must n be set for a given m to 
have a given screening probability? 

Definition. Let N p (m) be the minimum n such that P(m, 5 
n)2p. 

Note that as long as mgl/p, Np(m)=2; this follows from 
the fact that P(m, 2)=l/m as noted previously. Using the 
program that produced the data for the table shown in FIG. 
3, it is straight-forward to compute N r (m) lor somewhat to 
larger m. FIG. 4 shows a graph of N 1/10 , N 1/2 , and N 9/10 
versus m. For example, if m=f 0, a user must maintain 41 
addresses in order to have a 50-50 chance of filtering the 
message; if m=20, N 1/2 is fOO. The "slope" of N 1/2 is 
increasing, indicating superlinear behavior; however, the is 
■•third derivative" of N, 1/2 appears to be negative: the 
"slope" is roughly 5 at m=5, 6 at m=f2, and 7 not until 
m=22. Is it possible N p eventually slows down? 

As a practical matter, it is important to consider a more 
basic question first: does N p ever decrease? If it did, then it 20 
would make list-splitting more difficult to use, since one 
would have to check when increasing m whether one had 
actually decreased N p for some p. However, the following 
theorem shows this cannot happen. 



ill if p~ I 



l,7V p (m + l)g7V». 



1 he 



m 1-1)1)4. 



25 



The proof of this theorem is shown in Appendix A.2 at the 
end of the Detailed Description. 

Now, regarding the question of whether N^ slows down, 
since I know of no closed form for P(m,n), I will take an 
indirect approach to estimating its growth rate. First, I will 
show that Np(m) grows at least as fast as a linear function of 



V0§/;§1, Vm>0, N„(m)>pm. Theorem FDD6. 

The proof of this theorem is shown in Appendix A3 at the 
end of the Detailed Description. 

Theorem FDD6 allows us to conclude that N p (m) is at 
least linear in m. However, 1 will show that the limit as m 
goes to infinity of P(m,N /J (m)) is zero for all linear functions 
Np(m). Thus, Np(m) must grow faster than any linear 
function, because P(m,N / ,(m))lp >0 for all m. 



messages globally available to all users. When a user's mail 
software receives a new message, it checks it against the 
server's list; if found, it is discarded. Thus, in the best case, 
shortly after the first user reports a spam message, no other 
users are bothered by that message. Clearly, if every user 
reports every spam message seen, then only one user sees 
each message. (For this analysis, we ignore the geographic 
distribution of users which may lead to other users seeing the 
message before the first user's report reaches the server.) 

Advantages and Disadvantages. This approach has the 
advantage that the effort per user per message is very small 
in the best case. Moreover, it is pretty accurate and doesn't 
depend on managing keys, certificates, trust, multiple 
addresses or other side information. However, it has a few 
disadvantages as well. It requires the active participation of 
as many users as possible in not just deleting spam, but also 
forwarding it to the server. It lias the annoying property that 
other, possibly unknown, users decide what is spam for a 
given user. This could lead to users missing messages that 
they wish to see, because another user deemed it spam. 
Finally, it is susceptible to abuse by politically motivated or 
otherwise malicious users in blacklisting messages they do 
not wish others to see. 

CF vs LS(m) 

Not every user will report every message in a timely 
manner. Users may be away from email, lack the time, 
forget, or just not feel like it. To model this let us define the 

3 active set of users (over a given period of interest) to be the 
set of those users who do report spam in a timely fashion. 
Let A denote the cardinality of the active set, while L denotes 
the total number of users. Let a=A/L. 

Now, suppose the spammer splits the list m ways. If any 

; one of the m versions fails to reach an active user, then L/m 
users will receive the message. Suppose we wish to maintain 
a probability at most p that no such partial failure occurs. Let 
A^m) be the minimum value of A required to have prob- 
ability at most p of such a failure. 



A p (m)=LQ.-W™r' L ) 



The proof of this theorem is provided in Appendix B at the 
end of the Detailed Description. 

To get an idea of how this grows with m, we can show 



The proof of this theorem is shown in Appendix A3 at the 
end of the Detailed Description. 

In conclusion, LS seems to render FDD largely ineffec- ; 
tive. This is because for a given splitting number, m, the user 
must maintain unpractically many addresses, for example, 
if m=8 (requiring only three two-version paragraphs to 
systematically generate the message versions), users must 
maintain at least 31 addresses to have a 5(1-50 chance of s 
screening the message (as shown by the graph in FIG. 4). 
The analysis above shows that the minimum n required for 
any given nonzero screening probability increases faster 
than any linear function of m. This gives a decisive advan- 
tage to the spammers, since it is easier for them to increase t 
the number of slots/versions than it is for typical users to add 
enough addresses to keep the screening probability accept- 
ably high. 

Collaborative Filtering 



The proof of this theorem is provided in Appendix B at the 
end of the Detailed Description. 

The coefficient of m in Theorem CF2 ii 
tonically in m. Thus, the number of active list n 
to increase with the number of message versions. To guar- 
antee p at least 0.9, the table below shows the approximate 
required values of A p (using the approximation of Theorem 
CF2, which is independent of L, for simplicity). 
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-continued 



1024 9402 
2048 20224 



Clearly, if m is not much smaller than L, then collaborative 
filtering is of little benefit, since at least m list members (now 
a significant fraction) must see a version of the message. On 
the other hand, as long as m«L, the arms race also favors 
the spammer, since convincing users to become more active 
figures to be much harder than it is for the spammer to (e.g.) 
double m by creating one more paragraph variant-pair. In 
conclusion, it would appear that collaborative filtering loses 
the arms race against a list splitting spammer. 

Manual Filtering 

Manual Filtering (Ml ) denotes the fully manual process 
where the user reads some portion of each message, judges 
whether il is spam, and then deletes spam messages. This is 
by far the most widely used method of dealing with spam 

Advantages and Disadvantages. About the only advantage 
of MF is its accuracy. It is presumably the most accurate 
approach, since the individual makes the judgement himself, 
with no intervening program introducing the possibility of 
error. It is not perfect, however, due to fatigue, carelessness, 
and fallibility of the human user. Its primary disadvantage 
includes its labor intensity and fundamental unfairness to the 
user: virtually all the costs (time, money, information 
overload) are borne by the user and/or his service provider, 
whereas the spammer receives any profits or other benefits 
generated by the bulk mailing. 

MF vs LS(m) 

If a spammer sends each message only once to the List, 
there is no advantage to him in using LS, since each user just 
sees one message. However, if the spammer wishes to send 

a message multiple times to the list, as seems to be fairly 
common practice currently, then LS does benefit the spam- 
mer by decreasing the effectiveness of MF. This is because 
humans appear to be good at recognizing when a message 
duplicates a previously seen message, so when using MF, 
duplicate spam messages arc recognized sooner and hence 
deleted relatively quickly (without further cognitive 
consideration). Multiple versions of the same message, by 
contrast, will each be given cognitive consideration. 

For the succeeding analysis. I will assume it is signifi- 
cantly better for the spammer if the user reads more versions 
of the message than fewer; this presumably increases the 
possibility of the user acting favorably on the message. (This 
appears to be the reasoning motivating current spammers to 
send the same message multiple times.) On the other hand, 
I will assume that immediate deletion of an identical copy 
does not confer this advantage. Thus, the game is for the 
spammer to trick each user into reading and considering the 
message contents as many times as possible. 

Theorem MF1. For all mgl, kSl, if an LS(m)-spammer 
sends k times to the list (i.e., sends the same set of message 
versions k times, but randomly reassigns the addresses to the 
m slots each time), then each user receives an average of 



E(m,k) = n ^l-[^— -j j 

distinct message versions. The proof of this theorem is 
shown in Appendix C at the end of the detailed description. 

By differentiating E(m,k)/k with respect to k, one can 
show that it decreases for f<k<m. Therefore, E(m,k)/k is 

10 greater than or equal to E(m,m)/m, which by Lemma FDD8 
is approximately l-l/e<»0.632 . . . for large enough m. Thus, 
the spammer can be assured of getting at least about 0.632 
k message considerations per MF user in k^m rounds 
where, by hypothesis, the non-LS spammer gets only 1. Note 

15 also that for k«m, as would be typical, E(m, k)/k is very 
close to 1, hence E(m, k) is close to k even for moderate m. 
Clearly, manual filtering suffers when the spammer uses list 
splitting and resends messages to the list. 

Note that if the spammer wished to defeat only MF, he 

0Q would have no need to re-randomize the list each time. 
Simply cyclically permuting the message versions among 
the sublists each time would achieve slightly more message 
considerations (exactly k instead of almost k per user). 
However, this approach would be worse against FDD, 
because exactly the same set of users would get the message 
each lime, whereas by re-randomizing each time, the spam- 
mer will evade the FDD filters of a different subset of FDD 
users each time. Similar reasoning applies to make random- 
izing slightly more effective against CF as well. This choice 

30 is obviously a trade off that must be made based on the 
expected mix of techniques used by the target user popula- 

Impact on Other Anti-Spam Technologies 

,_ s A technique related to collaborative filtering that is 
reported to be in use by some ISPs is to detect when some 
fraction f of users all receive the same message; the message 
is then examined by a human to see if it is spam. This 
checking is most likely done by examining only the recipient 

40 lists in the SMTP envelope of the message, since it is 
obviously impractical to compare all pairs of the millions of 
messages received daily. If a list-splitting spammer sets m 
greater than 1/f, then no message version is likely to go to 
enough users to be Bagged as potential spam. And yet, ISPs 

45 cannot set f too low, as there are legitimate reasons to send 
messages to moderate-sized sets of users. 

Another anti-spam technique is to require a valid return 
address on any incoming message by verifying a valid DNS 
entry before delivering the message to the users. See B. 

50 Costales, E. Allman, & N. Rickert; Sendmail; Sebastopol, 
Calif.: O'Reilly and Assoc; 1993. See http://~.sendmail.org/ 
for the latest enhancements. List-splitting potentially mul- 
tiplies the number of verifications that need to be performed 
by a factor of m, if during version generation the return 

55 address is varied. This is true even if the verifying server 
caches verified addresses. Whether this is a significant 
impact depends on the size of m and on the cost of the 
verification process. 

Rule-based filtering depends on users (or administrators) 

60 maintaining a set of recognition rules to detect spam. While 
such rule sets can be arbitrarily powerful, some classes of 
rule set are vulnerable to LS. For example, one approach 
applies information retrieval tools (see G. Saltan, Ed.; The 
SMART Retrieval System: Experiments in Automatic Docu- 

65 ment Processing; Englewood Cliffs, N.J.: Prentice-Hall; 
1971) to discover information-carrying words that are highly 
indicative of spam content. However, nothing prohibits 
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spammers from sampling email strea: 
tools themselves. By incorporating syn 
message version generation, they car 
statistical properties of t 



is and running such 
inym technique:-, into 
use LS to alter the 
avoid such rule sets. 
In another approach, humans construct rules that are abstrac- 
tions of previously reported spam messages. This is equiva- 
lent to CF (hence vulnerable to LS) unless the abstractions 
capture other, unseen message versions as well. This 
depends on the skill and speed of the human rule set 



Various approaches are based on verifying the message 
sender via public key cryptography. See RSA Data Security; 
"SIMIME Central"; http: //raw.rsa.com/smime/; S. 
Garflnkel; PGP: Pretty Good Privacy; Sebastopol, Calif.: : 
O'Reilly and Assoc; f995. One approach is for users to 
accept only messages signed with a pre-approved key. This 
approach is impervious to list-splitting, since presumably 
one won't approve the key of a spammer. However, this 
approach is restrictive, since it doesn't allow one to receive ' 
email from mailing lists or any sources not pre-approved by 
the user. A less restrictive approach is to maintain a list of 
known spammers's keys and accept all signed messages 
except those signed by spammer keys. This approach is . 
impervious to list-splitting as described thus far: however, 
by maintaining a large set of valid public keys, a spammer 
can create alternative message versions signed by different 
keys. 

By increasing m, the spammer can succeed even when • 
users collaboratively post lists of spammer keys. Of course, 
if the cost to register a key is significant, it may be too costly 
for spammers to generate m new keys for each message. It 
is not clear what the cost of registering a key will be in the 35 
future, however. 

By contrast, the email channels approach (see R. J. Hall; 
How to avoid unwanted email; Comm. ACM 41(S'), 88-95, 
March 1998) exploits the simple idea that spammers must 
know a valid address in order to successfully send email to 40 
a user. The user is provided with a transparent way of 
allocating and deallocating different addresses for use by 
distinct correspondents. Thus, if a spammer obtains one 
address for a user and sends a message to il, [he user can _ ^ 
simply close the channel and all subsequent messages are J 
bounced by the server at the protocol level before the s 
message data are even transferred. Because this approach is 
not dependent on message content, it is completely imper- I 
vious to list-splitting. 50 

Thus, Anti-spam techniques based on the various forms of 
duplicate detection are useful only as long as spammers 
don't use the list-spliuing eounlereoumernieasurc, because 
the LS-spammer has a powerful advantage in the arms race. 
I believe the anti-spam research and development commu- 
nities should focus attention instead on the techniques that 
are impervious to list Splitting, such as cryptographic tech- 
niques and the email channels approach. 

e 

Appendix A: FDD-related Proofs 
A.f Computing P(m, n) 

Theorem FDD1. The number of assignments of n things t 
to m slots leaving a singleton in at least one of the first k 



where, following notation of Graham et al in R. Graham, D. 
Knuth, O. Patashnik; Concrete Mathematics: A Foundation 
for Computer Science; Reading, Mass.: Addison-Wesley; 
1989, 1994, 

M if(; = 0) 



Proof. When k=l, we must first assign any one of the n 
things to slot 1 and then assign the other n-1 to the other 
m-1 slots in all (m-l)"" 1 ways. Thus, S 1 (m,n)=n(m-1)"- 1 . 
Proceed by induction, assuming the theorem true for all 
l^iSk-l. Any assignment having a singleton in at least one 
of the first k slots either (a) has a singleton in at least one of 
the first k-1 slots (possibly one in slot k as well), or else (b) 
has slot k a singleton and has no singleton in any slot less 
than k. In case (a) there are Sj._ 1 (m,n) such assignments, and 
in case (b) we pick one of the n to be the singleton in slot 
k and then assign the rest to the other slots, but subtract all 
cases in which a singleton appears in a slot less than k: 
nKm-l)- 1 S^(m-1, n-1)]. To get S^m,n), we sum cases 
(a) and (b) and use the induction hypothesis: 



re righthand index of summation from 1 to 
ing the middle summand into the righthand 



Observe that k-1 choose k is zero, so we can let the left hand 
sum range up to min {k, n} as well and then combine like 
terms, getting 



Z>-'d7H;::)F'"-'>'-' 
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Applying the Pascal Triangle identity (see R. Graham, D. 
Kmith, O. Patashnik. Concrete Matlienuitics: A Foundation 
for Computer Science; Reading, Mass.: Addison-Wesley; 



we get exactly the righthand side as stated in the Theorem. 
A.2 Monotonicity of N p 

VmSl, V;ig2,P(m+l, /i)<max oai ..„. i ., 1 P(m,n-k). Lemma FDD3. 

Proof. Observe that in assigning n addresses to m+1 slots, 
we can first choose a subset of the addresses to put into slot 
1 and then assign the rest to the other m slots. Thus, we can 
break up the cases by the number of addresses assigned to 
slot 1. The probability of getting no singletons (in any slot) 
when the resulting assignment has k addresses in slot 1 is 



16 

A.3 Superlinearity of N p 



Let '!<■:/"" 1. Vn : , ii. 1 < /; < [qm J, /'[ni.r)<. 



Lemma FDD5. 



Proof. Order the n addresses. Then, in creating an assign- 
ment of the addresses to the m slots, we first assign the first 
n-1 addresses arbitrarily to the m slots. Then, if n^[qmj, 

i this leaves at least (m-LqmJ+1) empty slots. Thus, the 
probability of putting the nth and final address into a 
previously empty slot is strictly greater than (l-[qmj/'m), 
which is greater than or equal to (1-q). Thus, among all 
assignments, strictly more than (1-q) of them have a single- 

' ton containing only the nth address, hence the message will 
not be screened for those. 



WSpSl, Vm>0JV p (m)>pm- 



m FDD6. 



Proof. Since by definition P(m,N Ji) (m))>p, and since N 0 (m)= 
f , we apply Lemma FDD5 for the case q=p to conclude the 
theorem. 



Lemma FDD7. 



except when k=l, in which case there are no no-singleton 
cases (because of the singleton in slot 1). The formula above 
is justified by observing that any such assignment has to first 
choose the k elements, then the random assignments of those 2 
addresses must be to slot 1, then the other n-k addresses 
must each be assigned to one of the other m out of m+1 slots 
in such a way that there are no singletons among them. By 
summing these independent cases, we get 



Proof. This follows straight-forwardly from the observation 
that m 1 is a polynomial of degree j in m whose leading 
coefficient is 1. 



I em ma I 1)1)8. 



55 

(The second equality follows from the Binomial Theorem.) 
X^l/m, because Vmll, P(m, 2)=l/m. It follows that nm"" 2 
+f/X£m(nm"- 2 f)+l, which is positive if n§2, mil. Thus, 
P(m+1, n)<X. 

For all OSpSlmil, N p (m + l)iN p (m). Theorem FDD4. (l-jW = J (ty-jl 

Proof. Note that N 0 (m)=l for all m, so the p=0 case satisfies 

the theorem. For p>0, Vj, if fgj<N p (m) then P(m,j)<p; ^ m i 

hence, by Lemma FDD3, for all such j, P(m+l,j)<p as well. 65 = / , Tr' - -'/" 

Since P(m+1, N (m+l))£p by definition- of N, we conclude 
N^(m+l)>N^(m). 
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-continued 



By Lemma FDD7, we conclude 

m lim(l-;7mr = J^f ] 

The righthand side is just the Taylor Series for ex evaluated 
at x— j. 

I,cl us define <x(m) by the equation N„(m)=mu(m). 
Lemma FDD9. For a(m) as above, integer k^l, and for 
sufficiently large m, 

P k {m,a{m)m)*U- — \ 
Proof. Under the assumptions, 



which implies that P(m, a(m)m)<p goes to zero as m goes 
to infinity. But by assumption, a(m)m=N 7 ,(m), so P(m 
a(m)m)Sp>0 for all m. This contradiction proves that 



3 This proves the theorem. 

Appendix B: CF-related Proofs 

5 A„(m)=L(l-(l-pV"r ,L ) Theorem CF1. 

Proof. The probability that at least one of the L/m copies of 
a single message version hits the active set is (f-(f-a)L/'m). 
'Ifius, the probability that some copy of each of m 
) hits the active set is 

Solving for a, we get 

a-l-p-p 1 *")"" 

Since A=La by definition, we've proved CF1. 

Theorem (T2 



Nolo that the present discussion surrounds behavior for large ' 
m,so we may assume [he upper limit of summation in the 
formula for P^ is just k, since n=u(m)m grows without 
bound. 

Now, for large enough m, we can simplify our Pk expres- 
sion above: 



Ap{m) x m trj^- ^ 



■tern 



Proof. By Theorem CF1, 



A»=i(i-(i-p"-r' L ) 

D By the Binomial Theorem, 



=Z(T 



Substituting into the above equation and simplifying, 



Applying the Binomial Theorem to this last expression, we 
get the desired conclusion. 

N p (m)im is unbounded as m^-co. Theorem FDD10. 

Proof. Recall that by Theorem FDD6, N p (m)=a(m)m>pm>0 
for all m. We argue by contradiction as follows. Suppose 
a~0(l). Then 3L, 0<L<oo, such that p<a(m)<L for all m>m„ 
for some m„>0. Note that for all x <E[p,L], 0<q£x/e*<f/e<f , 
where q=min 



Thus, the righthand side expression in Lemma FDD9 goes 
to zero with increasing k. Since as m increases, making the 
approximation better, we can take larger and larger k as well, 
we conclude that for all real numbers |3>0, for large enough 



For m«L, we can approximate (m-iL) by -iL for all i^l: 



The parenthesized expression is, however, just the negative 
of the Taylor expansion of fn (l-p" m ). Moving the negation 
3 inside the logarithm, we get exactly the statement of the 
Theorem. 

Appendix C: MF-related Proof 
Theorem MF1. For all m^l, kgl, if an LS(m)-spammer 
5 sends k times to the list (i.e., sends the same set of message 
versions k times, but randomly reassigns the addresses to the 
m slots each time), then each user receives an average of 
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■i^J ■(sn'H-su.k 

5 « i * 

Proof. For a given recipient, the total number of possible =| ^ j* ; 1 W -j* *J ~ m ^ { ^ ' 
assignments of message versions to the k"slots" (message J J 1 1 + J ^ I 

rounds) is simply m*. The total number of such assignments t l 

involving exactly d distinct message versions, for l^d^k, w - m ' +1 - - m ^ j * |(m - 1 y- 



I \ The fourth, fifth, and sixth equalities above follow from 

1 > 20 Identity SN1 and the fact that for 

represents the number of distinct ways of partitioning a set „> 1 {"\ = 1 

of size k into d nonempty subsets (Stirling Numbers of the ' \ n I 
Second Kind, as described in R. Graham, D. Knuth, O. 

Palashnik; Concrete Miillteniittic\: A I omul, it ion for Cnm- -•*' . ... . . ... . . , , ... 

pater Science, Readme Mass, Addison-Wcslcv; 1989, ^ *e Pigeonhole Principle Dividing the last equality 

,„„.,,,.. . ' , - , . • , - , ..." . ' through by m , we get exactly the statement ol the I licoivm 

Iv'i-o I bis is true, because lor each distinct partitioning ol - • - . • 

., , , . ., ' . . • . ii I he present invention provides a system and method lor 

the k slots, we then assign anv ol m message versions to all .... r / . . . . , 

slots of the first partition; then for each such assignment any ."eteahrig spam countermeasures that employ techniques of 

of m-1 message versions to all slots of the second partition, 30 L U [J'r' [L - CL j Lai °"- . 

etc, down to for each such assignment to the first d-1 slots wnat is claimed is. 

. . • • j 1 . I. A method lor counteracting a filtering technique based 

assigning anv ol the remaining m-d+l message versions to . .. , .. ' . , ,.- ^ 

.. , , ... ,- - , - - ,. upon the detection ol a duplicate, including: 

all slots ol the dth partition. I he loagoing analysis implies , ,, f ,. * 

identity SN1: selecting an address on a list; 

35 randomly selecting a first sublisl Irom m sublists, where 
m is an integer greater than 1; 
determining if the selected address is substantially similar 

to an address on the selected first sublist; 
if the selected address is not substantially similar to an 

40 address on the selected first sublist, then assigning the 

Also, a simple case analysis yields (see R. Graham, D. address to the selected first sublist; 

Knuth, O. Patashnik; Concrete Mathematics: A Foundation creating m different versions of a message; and 

for Computer Science; Reading, Mass.: Addison-Wesley; sendjng a different versjon of the message tn each subHsL 

1989, 1994) identity SN2: 2 Thc mcthod of claim t whctcin if thc sclcctcd addrcss 

45 is substantially similar to an address on the selected first 

+ j * \ + d \ k ) sublist, then: 

{ d ) ( d - 1 J (dj randomly selecting a second sublist from the m sublists; 

determining il the selected address is substantially similar 

Now, each assignment to the k slots is equally likely, 50 to an address on the selected second sublist; and 

because LS assigns the message versions uniformly if the selected address is not substantially similar to an 

randomly, so the average number of distinct message ver- address on the selected second sublist, then assigning 

sions per user can be obtained In multiplying the number of the address to the selected second sublist. 

distinct versions received in each case (d) bv the probability 3- A method lor counteracting a filtering technique based 

of the case (D(m,k,d)/m*) and summing all cases: 5 5 u P on the detection of a duplicate, including: 

assigning an addrcss on a list to one of m sublists, where 



n integer greater than 1 
creating m different versions of a message including a 
second version of the message different from a first 
60 version of the message, the second version created by 
performing steps including creating a syntactically dif- 
Applying SN2 to the expression for E(m,k), we get fcrent version of a paragraph 0 f the first version, and 

including the syntactically different version of the 
paragraph of the first version in the second version; and 
i sending a different veision ol the message to each sublist. 
4. A method Lor counteracting a filtering technique based 
upon the detection of a duplicate, including: 
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assigning an address on a list to one of m sublists, where 
m is an integer greater than 1; 

creating m different versions of a message including a 
second version of the message different from a first 
version of the message, the second version created by 
performing steps including creating a syntactically dif- 
ferent version of a sentence of the first version, and 
including the syntactically different version of the 
sentence of the first version in the second version; and 

sending a different version of the message to each sublist. 

5. A method for counteracting a filtering technique based 
upon the detection of a duplicate, including: 

assigning an address on a list to one of m sublists, where 
m is an integer greater than 1; 

creating m different versions of a message including a 
second version of the message different from a first 
version of the message, the second version created by 
performing steps including creating a syntactically dif- 
ferent version of a string of the first version, and 
including the syntactically different version of the 
string of the first version in the second version; and 

sending a different version of the message to each sublist. 

6. A method for counteracting a filtering technique based 
upon the detection of a duplicate, including: 

assigning an address on a list to one of m sublists, where 

creating m different versions of a message; and 
sending a different version of the message to each sublist, 
including sending a first version of the message to at 
least one address on a first sublist and sending a second 
version of the message to at least one address on a 
second sublist. 

7. An apparatus for counteracting a filtering technique 
based upon the detection of a duplicate, comprising: 

a processor; 

a memory storing filter countermeasure instructions 
adapted to be executed by said processor to select an 
address on a list, to randomly select a first sublist from 
m sublists, m being an integer greater than 1, to 
determine if the selected address is substantially similar 
lo an address on the selected fust sublist. and. if the 
selected address is not substantially similar to an 
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address on the selected first sublist. then to assign the 
address to the selected first sublist, to create m different 
versions of a message, and to send a different version 
of the message to each sublist, said memory coupled to 
5 said processor; and 

a port adapted to be coupled to a network, said port 
coupled to said processor. 

8. The apparatus of claim 7, wherein the selected address 
is substantially similar to an address on the selected sublist. 

1° and wherein said filter countermeasure instructions are 
adapted to be executed by said processor to: 

randomly select a second sublist from the m sublists; 
determine if the selected address is substantially similar to 
is an address on the selected second sublist; 

if the selected address is not substantially similar to an 
address on the selected second sublist. then to assign 
the address to the selected sublist. 

9. A medium storing filter countermeasure instructions 
0Q that are adapted to be executed by a processor to perform 

steps including: 

selecting an address on a list; 

randomly selecting a first sublist from m sublists. where 
m is an integer greater than 1; 
-5 determining if the selected address is substantially similar 
to an address on the selected first sublist; 
if the selected address is not substantially similar to an 
address on the selected first sublist, then assigning the 
address to the selected first sublist; 
30 creating m different versions of a message; and 

sending a different version of the message to each sublist. 

10. The medium of claim 9, wherein the selected address 
is substantially similar to an address on the selected sublist, 

35 and wherein said medium stores filter countermeasure 
instructions adapted to be executed by a processor to: 
randomly select a second sublist from the in sublists; 
determine if the selected address is substantially similar to 
an address on the selected second sublist; and 
40 if the selected address is not substantially similar lo an 
address on the selected second sublist, then assign the 
address to the selected second sublist. 
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